# Loading Data for Exercise.
library(openintro)
library(dplyr)
data(email50)
glimpse(email50)
## Observations: 50
## Variables: 21
## $ spam <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0...
## $ to_multiple <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0...
## $ from <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ cc <int> 0, 0, 4, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0...
## $ sent_email <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1...
## $ time <dttm> 2012-01-04 21:19:16, 2012-02-17 04:10:06, 2012-0...
## $ image <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ attach <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0...
## $ dollar <dbl> 0, 0, 0, 0, 9, 0, 0, 0, 0, 23, 4, 0, 3, 2, 0, 0, ...
## $ winner <fct> no, no, no, no, no, no, no, no, no, no, no, no, y...
## $ inherit <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ viagra <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ password <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0...
## $ num_char <dbl> 21.705, 7.011, 0.631, 2.454, 41.623, 0.057, 0.809...
## $ line_breaks <int> 551, 183, 28, 61, 1088, 5, 17, 88, 242, 578, 1167...
## $ format <dbl> 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1...
## $ re_subj <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1...
## $ exclaim_subj <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0...
## $ urgent_subj <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ exclaim_mess <dbl> 8, 1, 2, 1, 43, 0, 0, 2, 22, 3, 13, 1, 2, 2, 21, ...
## $ number <fct> small, big, none, small, small, small, small, sma...
# Subset of emails with big numbers: email50_big
email50_big <- email50 %>%
filter(number == 'big')
# Table of the number variable
table(email50_big$number)
##
## none small big
## 0 0 7
# Drop levels
email50_big$number <- droplevels(email50_big$number)
# Another table of the number variable
table(email50_big$number)
##
## big
## 7
ifelse('logical test', 'if true', 'if false')ifelse()# Calculate median number of characters: med_num_char
med_num_char <- median(email50$num_char)
# Create num_char_cat variable in email50
email50_fortified <- email50 %>%
mutate(num_char_cat = ifelse(num_char < med_num_char, 'below median', 'at or above median'))
# Count emails in each category
email50_fortified %>%
count(num_char_cat)
## # A tibble: 2 x 2
## num_char_cat n
## <chr> <int>
## 1 at or above median 25
## 2 below median 25
The median marks the 50th percentile, or midpoint, of a distribution,
so half of the emails should fall in one category and the other half in the other.
# Create number_yn column in email50
email50_fortified <- email50 %>%
mutate(number_yn = case_when(
number == 'none' ~ "No", # if number is "none", make number_yn "no"
number != 'none' ~ "Yes" # if number is not "none", make number_yn "yes"
)
)
# Visualize number_yn
library(ggplot2)
ggplot(email50_fortified, aes(x = number_yn)) +
geom_bar()
ggplot2Reference: Data Visualization with ggplot2 (I) (II) (III)
# Scatterplot of exclaim_mess vs. num_char
ggplot(email50, aes(x = num_char, y = exclaim_mess, color = factor(spam))) +
geom_point()
ggplot2 automatically creates a helpful legend for the plot,
telling you which color corresponds to each level of the spam variable.
A study is designed to evaluate whether people read text faster in Arial or Helvetica font. A group of volunteers who agreed to be a part of the study are randomly assigned to two groups: one where they read some text in Arial, and another where they read the same text in Helvetica. At the end, average reading speeds from the two groups are compared.
Q: What type of study is this?
A: Experiment.
gapminder# Glimpse data
library(gapminder)
glimpse(gapminder)
## Observations: 1,704
## Variables: 6
## $ country <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, ...
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia...
## $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
## $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
## $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...
Since there is no way to randomly assign countries to attributes,
this is an observational study.
One of the early studies linking smoking and lung cancer compared patients who are already hospitalized with lung cancer to similar patients without lung cancer (hospitalized for other reasons), and recorded whether each patient smoked. Then, proportions of smokers for patients with and without lung cancer were compared.
Q: Does this study employ random sampling and/or random assignment?
A: Neither random sampling nor random assignment.
Random assignment is not employed because the conditions are not imposed on the patients by the people conducting the study.
Random sampling is not employed because the study records the patients who are already hospitalized,
so it wouldn't be appropriate to apply the findings back to the population as a whole.
Here we can see that the trend between x1 and y(gray dashed line), is reversed when x2(the grouping variable) is considered.
If we don’t consider x2, the relationship between x1 and y is positive.
If we do consider x2, we see that within each group the relationship between x1 and y is actually negative.
ucb_admit# Count number of male and female applicants admitted
ucb_admission_counts <- ucb_admit %>%
count(Gender, Admit)
ucb_admission_counts
## # A tibble: 4 x 3
## Gender Admit n
## <fct> <fct> <int>
## 1 Male Admitted 1198
## 2 Male Rejected 1493
## 3 Female Admitted 557
## 4 Female Rejected 1278
# Proportion of males admitted overall
ucb_admission_counts %>%
group_by(Gender) %>%
mutate(prop = n / sum(n)) %>%
filter(Admit == "Admitted")
## # A tibble: 2 x 4
## # Groups: Gender [2]
## Gender Admit n prop
## <fct> <fct> <int> <dbl>
## 1 Male Admitted 1198 0.445
## 2 Female Admitted 557 0.304
It looks like 44% of males were admitted versus only 30% of females, but there’s more to the story.
# Proportion of males admitted for each department
ucb_admission_counts <- ucb_admit %>%
# Counts by department, then gender, then admission status
count(Dept, Gender, Admit)
ucb_admission_counts
## # A tibble: 24 x 4
## Dept Gender Admit n
## <fct> <fct> <fct> <int>
## 1 A Male Admitted 512
## 2 A Male Rejected 313
## 3 A Female Admitted 89
## 4 A Female Rejected 19
## 5 B Male Admitted 353
## 6 B Male Rejected 207
## 7 B Female Admitted 17
## 8 B Female Rejected 8
## 9 C Male Admitted 120
## 10 C Male Rejected 205
## # ... with 14 more rows
ucb_admission_counts %>%
# Group by department, then gender
group_by(Dept, Gender) %>%
# Create new variable
mutate(prop = n / sum(n)) %>%
# Filter for male and admitted
filter(Gender == "Female", Admit == "Admitted")
## # A tibble: 6 x 5
## # Groups: Dept, Gender [6]
## Dept Gender Admit n prop
## <fct> <fct> <fct> <int> <dbl>
## 1 A Female Admitted 89 0.824
## 2 B Female Admitted 17 0.68
## 3 C Female Admitted 202 0.341
## 4 D Female Admitted 131 0.349
## 5 E Female Admitted 94 0.239
## 6 F Female Admitted 24 0.0704
We can see that the proportion of males admitted varies wildly between departments.
Within most departments, female applicants are more likely to be admitted.
Example: Randomly drawing names from a hat.
Example: If we wanted to make sure that people from low, medium, and high socioeconomic status are equally represented in a study, we would first divide our population into three groups as such and then sample from within each group.
Cluster and multistage sampling are often used for economical reasons.
Example: one might divide a city into geographic regions that are on average similar to each other and then sample randomly from a few randomly picked regions in order to avoid traveling to all regions.
us_regions <- get(load('D:/Downloads/us_regions.RData'))
# Simple random sample: states_srs
states_srs <- us_regions %>%
sample_n(size = 8)
# Count states by region
states_srs %>%
count(region)
## # A tibble: 4 x 2
## region n
## <fct> <int>
## 1 Midwest 4
## 2 Northeast 1
## 3 South 2
## 4 West 1
# Stratified sample
states_str <- us_regions %>%
group_by(region) %>%
sample_n(size = 2)
# Count states by region
states_str %>%
group_by(region) %>%
count(region)
## # A tibble: 4 x 2
## # Groups: region [4]
## region n
## <fct> <int>
## 1 Midwest 2
## 2 Northeast 2
## 3 South 2
## 4 West 2
A researcher designs a study to test the effect of light and noise levels on exam performance of students. The researcher also believes that light and noise levels might have different effects on males and females, so she wants to make sure both genders are represented equally under different conditions.
There are 2 explanatory variables (light and noise), 1 blocking variable (gender), and 1 response variable (exam performance).
'Explanatory' variables are conditions you can impose on the experimental units, while 'blocking' variables are characteristics that the experimental units come with that you would like to control for.
In random sampling, we use 'stratifying' to control for a variable. In random assignment, we use 'blocking' to achieve the same goal.